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Abstract. Distributed strategic learning has been getting attention in recent years. As systems 
become distributed finding Nash equihbria in a distributed fashion is becoming more important for 
various applications. In this paper, we develop a distributed strategic learning framework for seeking 
Nash equilibria under stochastic state-dependent payoff functions. We extend the work of Krstic 
et.al. in [l] to the case of stochastic state dependent payoff functions. We develop an iterative 
distributed algorithm for Nash seeking and examine its convergence to a limiting trajectory defined 
by an Ordinary Differential Equation (ODE). We show convergence of our proposed algorithm for 
vanishing step size and provide an error bound for fixed step size. Finally, we conduct a stability 
analysis and apply the proposed scheme in a generic wireless networks. We also present numerical 
results which corroborate our claim. 

Key words. Stochastic estimation, State-dependent Payoff, Extremum seeking. Sinus pertur- 
bation, Nash Equilibrium. 

1. Introduction. In this paper we consider a fully distributed system, which 
consists of non cooperative nodes which can be modeled as a non cooperative game 
for Nash seeking. Let us consider a distributed system with N nodes or agents which 
interact with one another and each has a payoff/utility /reward to maximize. The 
decision or action of each node has an impact on the reward of the other nodes, which 
makes the problem challenging in general. In such systems each node has access to a 
numerical value of their utility /reward at each time. In such systems it might not be 
possible to have a bird's eye view of the system as it is too complicated or is constantly 
changing. Let aj^k be the action of node j at time k and the numerical value of the 
utility of this node is given by fj^k- Where fj,k = rj(Sk, a^) + rij,k were rjj^k represents 
noise , Vj : S x — ^ M is the payoff function of node j, S^ e 5 C C^''^ is the 
state such that S is compact, a^, = {ai_k, ■ ■ ■ , o.N,k) is the action vector containing 
actions of all nodes at time k. Figure [TTT] shows the system model where we have N 
interacting nodes. The rewards are interdependent as the nodes interact with one 
another. The only assumption that we can make here is the existence of a local 
solution. Each of these nodes j has access to the numerical value of their respective 
reward fj,k and it needs to implement a scheme to select an action aj^k such that its 
utility is maximized. The above scenario can be interpreted as an interactive game. 
In this paper we explore learning in such games which is synonymous with designing 
distributed iterative algorithms that converge to the Nash equilibrium. 

• Different approaches, mainly based on gradient descent or ascent method [3], 
have been developed to achieve a local optimum (or global optimum in some special 
cases, e.g. concavity of the payoff, etc.) of the distributed optimization problem. 
The method of gradient ascent is also called steepest ascent method, which starts 
at a point ao and, as many times as needed, moves from Uk to ak+i by maximizing 
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Fig. 1.1. Nodes interacting with each other through a dynamic environment 



along tlie line extending from ak in the direction of Vr^ (Sfe, at), the local down- 
hill gradient. This gives the iterative scheme aj^k+i = aj,k + AfcVrj(Sfc,afc) where 
Afc > is a learning rate/step size. For the applicability of the above algorithm it 
is necessary to have access to the value of Vrj(.) at each time k. The action can 
be positive and upper bounded by a certain maximum value aj^max > for some 
engineering applications. Thus, the component aj^k needs to be projected in the 
domain [0,aj^max]- This leads to a projected gradient descent or ascent algorithms: 
a-j.k+i = proj[o^aj „ax] {'^j<k + AfcVrj(Sfc, afc)} where proj denotes the projection opera- 
tor. At each time k, node j needs to observe/compute the gradient term Vrj(Sfc, afc). 
Use of the aforementioned gradient based method requires the knowledge of (i) the 
system state, (ii) the actions of others and their states and or (iii) the mathematical 
structure (closed form expression) of the payoff function. As we can see, it will be 
difficult for node j to compute the gradient if the expression for the payoff function 
rj{.) is unknown and/or if the states and actions of other nodes are not observed as 
rj{.) depends on the actions and states of others. 

• There are several methods for Nash equilibrium seeking where we only have 
access to the numerical value of the function at each time and not its gradient (e.g. 
Complex functions which cannot be differentiated or unknown functions). Some of 
them are detailed below. 

The stochastic gradient ascent proposes to feedback the numerical value of gradi- 
ent of reward function Vr^ of node j (which can be noisy) to itself. This supposes in 
advance that a noisy gradient can be computed or is available at each node. Note that 
if the numerical value of the gradients of the payoffs are not known by the players, 
this scheme cannot be used. In [4] projected stochastic gradient based algorithm is 
presented. A distributed asynchronous stochastic gradient optimization algorithms is 
presented in [5 . Incremental Sub-gradient Methods for Non-differentiable Optimiza- 
tion are discussed in A distributed Optimization algorithms for sensor networks 
is presented in [7]. Interested readers are referred to a survey by Bertsekas [5] on 
Incremental gradient, subgradient, and proximal methods for convex optimization. 
In [5] the authors present Stochastic extremum seeking with applications to mobile 
sensor networks. 

• Krstic et.al. in recent years have contributed greatly to the field of non-model 
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based extremum seeking. In 1 , the authors propose a Nash seeking algorithm for 
games with continuous action spaces. They proposed a fuhy distributed learning 
algorithm and requires only a measurement of the numerical value of the payoff. 
Their scheme is based on sinus perturbation (i.e. deterministic perturbation instead of 
stochastic perturbation) of the payoff function in continuous time. However, discrete 
time learning scheme with sinus perturbations is not examined in [I] . In [lOj extremum 
seeking algorithm with sinusoidal perturbations for non-model based systems has been 
extended and modified to the case of i.i.d. noisy measurements and vanishing sinus 
perturbation, almost sure convergence to equilibrium is proved. Sinus perturbation 
based extremum seeking for state independent noisy measurement is presented in 
[TT| . Kristic et al. [2] have recently extended Nash seeking scheme to stochastic 
non-sinusoidal perturbations. In this paper we extend the work in [T] to the case of 
stochastic state dependent payoff functions, and use deterministic perturbations for 
Nash seeking. One can see easily the difference between this paper and the previous 
existing works [10] [11]. In these works, the noise rjj associated with the measurement 
is i.i.d. which does not hold in practice especially in engineering application where 
the noise is in general time correlated. In our case, we consider a stochastic state 
dependent payoff function and our problem can be written in Robbins-Monro form 
with a Markovian (correlated) noise given by rjj — rj(S,a) — Es[7'j(S,a)] (this will 
become clearer in the next sections), i.e. the associated noise is stochastic state 
dependent which is different from the case of i.i.d. noise. 

Although stochastic estimation techniques do estimate the gradient but they in- 
troduce a level of uncertainty, to avoid this it is possible to introduce sinus perturba- 
tion instead of stochastic perturbation. This is particularly helpful when one node is 
trying to follow the actions of the other nodes in a certain application. 

1.1. Contribution. In this paper, we propose a discrete time learning algo- 
rithm, using sinus perturbation, for continuous action games where each node has 
only a numerical realization of the payoff at each time. We therefore extend the clas- 
sical Nash Seeking with sinus perturbation method [1, to the case of discrete time and 
stochastic state-dependent payoff functions. We prove that our algorithm converges 
locally to a state independent Nash equilibrium in Theorem [T] for vanishing step size 
and provide an error bound in Theorem [2] for fixed step size. Note that since the 
payoff function may not necessarily be concave, finding a global optimum at afford- 
able complexity can be difficult in general even in deterministic case (fixed state) and 
known closed-form expression of payoff. We also show the convergence time for the 
sinus framework in Corollary [TJ In this paper we analyze and prove that the algo- 
rithm converges to a limiting ODE. We provide the convergence time and error bound 
between our discrete time algorithm and the ODE. 

The proof of the theorems are given in Appendix |XJ 

1.2. Structure of the paper. The remainder of this paper is organized as fol- 
lows. Section [2] provides the proposed distributed stochastic learning algorithm. The 
performance analysis of the proposed algorithm (convergence to ODE, error bounds) 
is presented in section [3J A numerical example with convergence plots is provided in 
section m Section [5] concludes the paper. Appendix contains the proofs. 

1.3. Notations. We summarize some of the notations in Table [Ol 

2. Problem Formulation and Proposed Algorithm. Let there be N dis- 
tributed nodes each with a payoff function represented by rj(Sk,aj^k,^-j.k) at time 
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Table 1.1 
Summary of Notations 



Symbol 


Meaning 


N 


set of nodes 


Aj 


set of choices of node j, 


S 


state space 




payoff of node j 




decision of j at time k 






E 


expectation operator 


V 


gradient operator 



k which is used to formulate the following robust problems: 

sup Esr, (S, a„ a_, ) V j e N ^ {I, . . . , N} (2.1) 

aj>0 

A solution to the problem (j2.1|) is called state-independent equilibrium solution. 

Definition 1 (Nash Equilibrium (state-independent)), a* = {a*,a*_j) € Yly Aj' 
is a (state-independent) Nash equilibrium point if 

EsrjiS, a*,a*_j) > Esrj{S, a'j,a*_^), Va^- e Aj, a'j ^ a* (2.2) 

where Es denotes the mathematical expectation over the state. 

Definition 2 (Nash Equilibrium (state-dependent)). We define a state- dependent 
strategy aj of a node j as a mapping from S to the action space Aj . The set of state- 
dependent strategy is VQj : {oj : S — > Aj, S i — > aj(S) G Aj}. 

i 

is a (state- dependent) Nash equilibrium point if 

Esr,(S,5*(S),al ^-(S)) > Egr, (S, a;(S), a*_^. (S)), Va; E VG, (2.3) 

Here we define a :— {aj,a.-j) Assuming that node j has access to it's realized payoff 
at each time k but the closed-form expression of rj {Sk, o-j,k, 3— j,fc) is unknown to node 
j. A solution to the above problem is a state-independent equilibrium in the sense no 
node has incentive to change its action when the other nodes keep their choice. It 
is well-known that equilibria can be different than global optima, the gap between 
the worse equilibrium and the global maximizer is captured by the so-called price of 
anarchy. Thus solution obtained by our method can be suboptimal with respect to 
maximizing the sum of all the payoffs. We study the local stability of the stochastic 
algorithm. 

The robust game is defined as follows: J\f is the set of nodes, Aj is the action 
space of node j. S is the state space of the whole system, where S C C^^^; and 
r j : iS X O/gA/'^^i' — ^ K is a smooth function. It should be mentioned here for 
clarity that the decisions are taken in a decentralized fashion by each node. Let us 
continue by stating that J\f is the set of nodes, Aj is the action space of node j, S is 
the state space of the whole system, where S C C^x^ and rj : S x Ylj'eJ\f -^j' — ^ ^■ 
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Games with uncertain payoffs are called robust games. Since state can be stochas- 
tic, we get a robust game. Here we will focus on the analysis of the so-called expected 
robust game i.e (A/", Aj, Ksrj{S, .)). A (state-independent) Nash equilibrium point [14] 
of the above robust game is a strategy profile such that no node can improve its payoff 
by unilateral deviation, see Definition [T] and Definition [2] 

Since the current state is not observed by the nodes, it will be difficult to im- 
plement state-dependent strategy. Our goal is to design a learning algorithm for a 
state-independent equilibrium given in Definition [T] In what follows we assume that 
we are in a setting where the above problem has at least one isolated state-independent 
equilibrium solution. More details on existence of equilibria can be found in Theorem 
3 in [S]. 

2.1. Learning algorithm. Suppose that each node j is able to observe a nu- 
merical value fjk of the function rj(Sfc,afc) at time fc, where a^ = (aj^k,^-j,k) is 
the action of node j at time k. aj ^ is an intermediary variable, aj, Vlj <f>j repre- 
sent the amplitude frequency and phase of the sinus perturbation signal given by 
hj sin(r2jfc -|- fj^k+i represents the payoff at time fc -|- 1. The learning algorithm 
is presented in Algorithm [T] and is explained below. At each time instant fc, each 
node updates its action aj^k^ by adding the sinus perturbation i.e. hj sin(r2jfc -I- 0j) to 
the intermediary variable cij^k using equation ()2.4p . and makes the action using aj^k- 
Then, each node gets a realization of the payoff fj^k+i from the dynamic environment 
at time k + 1 which is used to compute hj^k+i using equation (|2.5p . The action aj_k+i 
is then updated using equation (|2.4p . This procedure is repeated for the window T. 

The algorithm is in discrete time and is given by 

flj.fc = oij.k + bj sin(iljfc -|- 0j) (2.4) 
aj^k+i = a-j^k + XkZjbj sin{iljk + (f)j)rj^k+i (2.5) 

where k := YX'=i ^k' , % -t- ^ yjJ'J"- 

For almost sure convergence, it is usual to consider vanishing step-size or learn- 
ing rate such as = However, constant learning rate Afe = A could be more 
appropriate in some regime. The parameter (jjj belongs to [0, 27r]V j, k 



Algorithm 1 Distributed learning algorithm 
1: Each node j, initialize ajfl and transmit 
2; Repeat 

3; Calculate action aj^k according to Equation ()2.4|) 
4: Perform action ajj^ 
5: Observe fj^k 

6: Update aj,k+i using Equation (j2.5p 
7; until horizon T 



Remark 1 (Learning Scheme in Discrete Time) . As we will prove in subsection 
\3.H the difference equation |^.^[ ) can be seen as a discretized version of the learning 
scheme presented in But it is for games with state- dependent payoff functions 
i.e., robust games. It should be mentioned here for clarity that the action Oj^k of 
each node j is scalar. 

2.2. Interpretation of the proposed algorithm. In some sense our algorithm 
is trying to estimate the gradient of the function rj{.), but we don't have access to 
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the function but just its numerical vahie. The following equation clearly illustrated 
the significance of each variable and constant in the algorithm. 



Learning Perturbation 
Rate Frequency 




ttj^k+i = &j,k + Afc zj bj sin( Qj k + (pj ) fj,k+i (2.6) 



New Old Growth Perturbation Perturbation 

Value Value Rate Amplitude Phase 

The learning rate Xk can be constant or variable depending on the requirement 
for the algorithm and system limitations. Perturbation amplitude bj > is a small 
number. Zj > is also a small value which can be varied for fine tuning. Rewriting the 
above equation we get 



= z,bj sin(l],i + ^j)rj,k+i (2.7) 

For vanishing step size as k — > oo Xk — > and the trajectory of the above 
algorithm coincides with the trajectory of the ODE in equation p. lip 

3. Main results. In this section we present the convergence results as intro- 
duced in the contribution section. 

We introduce the following assumptions that will be used step by step0. 

Assumption 1 (A[IJ Vanishing learning rate). Afc > 0, Y,k = oo, J^k l-^^l^ < 
oo. There exists Cq > such that P (supj, || \\< Cq) — 1. The reason for Al is that 
Xk represents the step size of the algorithm. So the sum over all Afc = oo as it 
needs to traverse over all discrete time. The condition |Afcp < oo ensures bound 
for the cumulative noise error. This last assumption is for a local stability analysis. 

Assumption 2 (AlH Constant learning rate). A* = A > 0, supJE||at|p]5 < +oo 
and 1 1 at IP is uniformly integrable. 

Assumptions (AdExistenceof alocalmaximizer). Es^^^4!~ = Eg^^^^^^ < 
0. These two conditions tell us that a* is a local maximizer of Oj — > Esrj(S. , al^) 
where a*_j = {al, . . . ,a*_j^,a*^j^, . . . ,a'*). 

Assumption 4 (AH) Diagonal Dominance) . the expected payoff has a Hessian 



Ec 



S a^f ) - l^r^J aa.da^, ) 



> 



that is diagonally dominant at a*, i.e., 

0. Note that A4 implies that the Hessian of the expected payoff is invertible at a*. This 
assumption is weaker compared to the classical extremum seeking algorithm because 
the Hessian of rj{S,a*) does not need to be invertible for each S. 

We assume S i — > rj(S,a) is integrable with respect to S so that the expectation 
Eg^j (S, a) is finite. 

3.1. Convergence to ODE. 

Stochastic approximation. First we need to show that our proposed algorithm 
converges to the respective ODE almost surely. We will use a dynamical system 
viewpoint and stochastic approximation method to analyze our learning algorithm. 
The idea consists of finding the asymptotic pseudo-trajectory of the algorithm via 
ordinary differential equation (ODE). To do so, we use the framework initiated by 



^We do not use Al and A2 simultaneously. 
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Robbins- Monro [Hj or [TT]. See [ISIII^ for recent development. The works in P^[T^ 
allows us to find the limiting trajectory of the learning algorithm. 

Our scheme can be written as (ij^k+i = aj",fc + XkZjhj s\n.{Q. jk + 4>j)fj_k+i- Now 
we rewrite the above equation in Robbins-Monro |16| form as: %,fc+i = a^-^fc + 
Xk [fj{k,aLk) + Mk+i] , where fj{k,ak) = z^foj sin(%fc + 0j)Esrj (S, afc), 
Affc+i = Zjbj sin(rijfc + <^j) [fj^k+i ~ IEs7'j(S, afc)] . 

Since our payoff rj is Lebesgue integrable with respect to S, expectation of payoff 
function Esrj(S,a) is finite. Mk+i is clearly a martingale adapted to the filtration 
J-fe generated by the random variable S^', k' < k and the initial law of slq. Moreover 
Mk+i has a zero mean. Thus, Mk+i is a difference martingale. 

Theorem f (Variable Learning Rate). Under Assumption Al, the learning al- 
gorithm converges almost surely to the trajectory of a non- autonomous system given 
by 

■^Oj^t = Zjbj sm{njt + (t)j)Es {rjiS, at)) 
aj,t = Oj^t + bj sin{fljt + 4>j) 



The gap between the interpolated version of algorithm and the solution of the 
ODE is bounded by 



sup lla(t) - a^^{t)\\ < KT^te"-^ + CrXt+k 
te[tk,tk+T] 

which vanishes, where a.{t) is the interpolated version of the algorithm and a** (t) 
is the solution of the ODE at time t starting from tk :— z_/t'=o'^*'' '^here L is the 
Lipschitz constant for the ODE and T is the time window. Kt.i is specified below. 

In order to calculate the bound we need to define a few terms which are helpful 
in obtaining a compact form of the bound. 



Kr.t = CTLY,>^t+k + sup \\6t,t+k\\ (3.f) 
fc>0 ^=^^0 

St,t+k — ~ 6 (3-2) 
t-i 

6 ^ ^ XmM„,+i (3.3) 

m— 

Ct = ||r(0)|| + L{Co + |lr(0)|lr)e^^ < oo (3.4) 



To prove that the learning algorithm (discrete ODE) converges to the ODE we 
need to verify conditions from Borkar [T^ Chapter 2 Lemma 1. 

lim sup \\as — a^W — a.s. 

This is an important result as it gives us an approximation on the error between 
our algorithm and the corresponding ODE. 

Theorem 2 (Fixed Learning Rate). Under Assumption A2, the learning algo- 
rithm converges in distribution when A — > 0, to the trajectory of a non- autonomous 
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system given by 



—cLj^t = Zjb J sm{n jt + (t>j)Es{rj{S, at)) (3.5) 
Uj^t — aj,t + bj sm{fljt + (f>j) (3-6) 

Moreover the error gap is in order of X. As X converges to zero, the algorithm con- 
verges (in distribution) to the ODE. 

The advantage of Theorem[2]compared to Theoreni[T]is the convergence time. The 
number of iterations required to reach a fixed time T is less with constant learning rate 
than the vanishing learning rate. However, the convergence notion under constant step 
size is weaker (it is in distribution) compared to the almost surely convergence with 
vanishing learning rate. So there is a sort of tradeoff between almost sure convergence 
and convergence time. 

Let A( be the gap between the ODE and the isolated equilibrium at time t. 

Theorem 3 (Exponential Stability). Assume A3-A4 and Remark holds. 
Then, there exist M,m > and e,bj such that, for all e £ (0,e) and bj G {0,bj), if 
the initial gap is Ag (which is small) then for all time t, 

At < yi,t (3.7) 

where 

A Me-™*Ao + 0(e + maxfo^) (3.8) 



Proof. [Sketch of Proof of Theorem [3] 

Local stability proof of Theorem [3] follows the steps in [13] . 

□ 

From the above equation it is clear that as time goes to infinity the first term in yi^t 
bound vanishes exponentially and the error is bounded by the amplitude of the sinus 
perturbation i.e. 0(e + maxj6^). This means that the solution of ODE converges 
locally exponentially to the state-independent equilibrium action a* provided the 
initial solution is relatively close. 

Definition 3 (e— Nash equilibrium payoff point). An e—Nash equilibrium point 
in state-independent strategy is a strategy profile such that no node can improve its 
payoff more than e by unilateral deviation. 

Definition 4 (e— close Nash equilibrium strategy point). An enclose Nash equi- 
librium point in state-independent strategy is a strategy profile such that the Euclidean 
distance to a Nash equilibrium is less than e. A e— close Nash equilibrium point is an 
approximate Nash point with a precision at most e. 

It is not difficult to see that for Lipschitz continuous payoff functions, an e— close 
Nash equilibrium is an Le— Nash equilibrium point where L is the Lipschitz constant. 

Next corollary shows that one can get an e— close Nash equilibrium in finite time. 

Corollary 1 (Convergence Time). Assume A3-A4 and Remark [5^4] holds. 
Then, the ODE reaches a (2e + maxj b^)~ close to a Nash equilibrium in at most 

T time units where T — i- 1ok( AiiM.') 

Proof. [Sketch of Proof for Corollary [1] The proof follows from the inequality 
([3Jl) in Theorem H □ 

Corollary 2 (Convergence to the ODE). Under Assumption Al, A3, and A4, 

8 



the following inequality holds almost surely: \\ a.f — a* ||< yi t + ?/2,t where 

2/2.t = CriXt+k + Ly] Xt+k') + sup \\St.t+k'\\ (3.9) 
fc'>0 ^='^0 

Proof. [Proof of Corollary [2] The proof uses the triangle inequality || — a* ||<|| 
at — a* II + II a* — a* || . By Thcorcm[Tl one gets || at — at \\< yi^t and by Theorem[31 
one has || a^ — a* ||< ?/2,t Combining together, one arrives at the announced result. □ 

Then constants in equation p. 81) and p.9p depends on the number of players and 
the dimension of the action space. 

3.2. Convergence of the stochastic ODE. In this subsection we study the 
stochastic ODE given by 



Oj^t = cLj,t + bj sm{fljt + (pj) (3.10) 
'j^o.j^t — Zjbj sin(57ji + <t>j)fj,t (3.11) 

where rj^t is the realization of the state-dependent payoff rj(St,at) at time t. We 
assume the state process is ergodic so that, 

lim 4/ lJ-j{t)rj{St,at) dt ^ lim ^[ fij{t)]Ksrj{S,at) dt 

T — >oo I Jq t — >oo 1 Jq 

In particular the asymptotic drift of the deterministic ODE and the stochastic ODE 
are the same. Hence, the following theorem follows: 

Theorem 4 (Almost sure exponential stability). The stochastic algorithm]^ 
converges asymptotically almost surely to the stochastic ODE in equation \3.11\) i.e. 

P{\\ at a* \\< yij + y2,t) = 1 a.s. 

Since the state process is ergodic, we can apply the stochastic averaging theorem 
from [2] to get the announced result. 

4. Numerical Example: A Generic Wireless Network with Interfer- 
ence. Even though the distributed optimization problem, considered in this paper, 
and the developed approach are general and can be used in many application domains. 
As an application of the above framework, we will consider the problem of power con- 
trol in wireless networks in order to better illustrate our contribution. Consider an 
interference channel composed of TV transmit receiver pairs as shown in Figure 14.11 
Each transmitter communicates with its corresponding receiver and incurs an inter- 
ference on the other receivers. Each receiver feeds back a numerical value of the payoff 
7j(H,p) to its corresponding transmitter. 

The problem is composed of transmitter-receiver pairs; all of them use the same 
frequency and thus generate interference onto each other. Each transmitter- receiver 
pair has therefore its own payoff/reward/utility function that depends necessarily on 
the interference exerted by the other pairs/nodes. Since the wireless channel is time 
varying as well as the interference, the objective is necessarily to optimize in the long- 
run (e.g. average) the payoff functions of all the nodes. The payoff function of node j 
at time k is denoted by rj{Hk,Pk) where := [hk{i,j)] represents an N x N matrix 
containing channel coefficients at time k, hk{i,j) represents the channel coefficient 
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between transmitter i and receiver j (where («, j) € A/"^) and represents the vector 
containing transmit powers of N transmit-receive nodes. The most common technique 
used to obtain a local maximum of the nodes' payoff functions is the gradient based 
descent or ascent method. 
Remark 2. 



Table 4.1 
Equivalent Notations for Wireless 



General 


Application 


Description 




7i,fe 


utility/payoff of transmitter j at time k 


aj,k 


Pj.k 


action/power of transmitter j at time k 


Sjj' ,k 


9jj' ,k 


state/channel gain between transmitter 






j and receiver j' at time k 




Fig. 4.1. Interference Channel Model 



In section [31 we proved that our proposed algorithm converges to p* for any 
type of payoff functions which satisfies the assumptions in section 13.11 In order to 
show numerically that our algorithm converges to p* , we run our algorithm for a 
simple payoff function. In parallel, we obtain analytically the Nash equilibrium p* 
and compare the convergence point of our algorithm to p* . We therefore choose a 
simple payoff function for which p* can be obtained analytically. 

The payoff function of node j at time k has then the following form: 



7,(Hfc,pfc)= w log(l + ^ P^^^9rjM ^_ 

ba^dth <'^ + T.r^oPr^kgr,M ^ 

Danawiatlis ^ ^ constraint on powers 



Rate 



where uj represents the bandwidth available for transmission. The above payoff func- 
tion 7j(Hfc, pfc) consists oilog of (l+SINR) of user j and the unit cost of transmission 
is K. It is assumed that a used doesn't know the structure function 7j(.) or the law 
of the channel state. For the above payoff function to ensure the assumption A3-A4 
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and Remark I3I4I we need to satisfy the condition E|/ijjp > K^ji-^j l^/jP- Please 
see appendix for more details. 

The problem here is to maximize the payoff function 7j(H, p) which is stated as 
follows: find p* such that for each user j g Af, satisfies 



p* e argmaxp^.>oE7j(H,p^, . . . ,p*_^,pj,p 



. . . Note that when gjj = then 

the payoff of user j is negative and the minimum power p* = is a solution to the 
above problem. For the remaining, we assume that = gjj > 0. 

The channel hjj' is time varying and is generated using an independent and 
identically distributed complex gaussian channel model with variance a^^, such that 
ajj = 1 ajjr = 0.1, j' ^ J. The thermal noise is assumed to be a zero mean gaussian 
with variance such that = 1. 

We consider the following simulation settings with TV = 2 for the above wireless 
model: h = 0.9, /ca = 0.9, 0i = 0,(/)2 = 0,r2i = 0.9,^^2 = 1, 6i = 0.9,^2 = 0.9. 
The numerical setting could be tuned in order to make the convergence slower or 
faster with some other tradeoff. Due to space limitations further discussion on how to 
select these parameters has been omitted, pi o and p2,o represent the starting points 
of the algorithm which are initialized as pi o = + 10 and p2.o + = 2 

is the penalty for interference, w = 10 is the bandwidth and the variance of noise 
is normalized. Figure 14.21 represents the average transmit power trajectories of the 
algorithm for two nodes. The dotted line represents p*. As can be seen from the plots 
that the system converges to p* where p* — 3.9604, j e {1,2}. 
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Fig. 4.2. Power evolution (discrete time) 

The example we discussed is only one of the possible types of applications where 
our proposed algorithm can be implemented. 
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Fig. 4.3. Payoff evolution (discrete time) 



Consider for example the following payoffs: qi{.) = goodput{.) and (72(0 = 
¥{goodput{.) < 77) where 77 is a small value and P(.) stands for probability. Goodput 
represents the ratio of correctly received information bits vs the number of trans- 
mitter bits. In wireless communications the channel is constantly changing due to 
various physical phenomenon and interference from other sources and changes in the 
environment. It is hard to have a closed form expression for q\{.) due to complexity 
of the transmitter, receiver and unknown parameters. In practice, at each time k, the 
receiver has therefore a numerical value of goodput{.) but no closed form expression 
for rate/goodput is available especially for advanced coding scheme (e.g. turbo code, 
etc.). (72(-) represents an outage probability for which also depends on the goodput, 
the gradient for q2{-) is notoriously hard to compute without channel and interference 
statistics knowledge (probability distribution function) and closed form expression of 
goodput{.) . Our scheme can be particularly helpful in such scenarios. 

The price/design parameter k inside the reward function can be tuned such that 
the solution of the distributed robust extremum coincides with a global optimizer of 
the system designer. The k can be same for all nodes or each node can have its own 
Kj . Let a* represent the optimal action or set of actions to be performed by each node 
to maximize their respective utilities. It is possible to set k such that the following 
equation is satisfied. a(K) = a*, k could represent a scalar or a vector depending on 
the system size and the application. To be able to effectively make a(/t) equal to a* 
we need to have enough degrees of freedom in the system. However this type of tuning 
is not true in general. 



5. Concluding remarks. 
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Work Presented: In this paper we have presented a Nash seeking algorithm which is 
able to find the local minima using just the numerical value of the stochastic 
state dependent payoff function at each discrete time sample. We proved 
the convergence of our algorithm to a limiting ODE. We have provided as 
well the error bound for the algorithm and the convergence time to be in 
a close neighborhood of the Nash equilibrium. A numerical example for a 
generic wireless network is provided for illustration. The convergence bounds 
achieved by our method are dependent on the step size and the perturbation 
amplitude. 

New Class of Functions: In this work we introduced a new class of state depen- 
dent payoff functions ^^(S, a) which are inspired from wireless systems appli- 
cations. But these kind of functions are more general and appear in other 
application areas. 

Achievable Bounds: As it is clear from results in Theorem [T] that convergence 
depends on an exponential term and the amplitude of the sinus perturbation. 
As amplitude becomes smaller, the error bound also vanishes. In contrast the 
standard stochastic subgradient method only depend on the step size. 

Global Analysis: All the work considered in this paper including Krstic et.al. con- 
sider local stability. Our work is an extension of their work and works for 
local stability. The future work will focus on the extension to the case of 
Global Stability of Nash equilibrium for both deterministic and stochastic 
payoff functions. 

Multidimensional Aspect: The presented work has been studied for scalar reward 
and scalar action by each node. Scalar scenario has several applications to 
wireless (as in the aforementioned example) and sensor networks and numer- 
ous examples can be considered. A possible extension to this work could be 
in the direction of vector actions where each users is able to perform multiple 
actions based on multiple rewards. 

Appendix A. Convergence Theorems. 

A.l. Variable Step Size: Proof of Theorem [ll The Theorem [T] states that 
Under Assumption Al, the learning algorithm converges almost surely to the trajec- 
tory of a non-autonomous system given by 

^%,f = Zjbj smifljt + 4>j)Es (rj(S,at)) 
aj,t — cij^t + bj sin(ilji -I- 4>j) 

The proof follows in several steps. 

• The first step provides conditions for Lipschitz continuity of the expected 
payoff which is given in Lemma[T] From Lemma[2]we have that Vj', t, fj (t, a) = 
bjZj sm{n jt + 0j)Es?'j(S, a), is Lipschitz over the domain V 

• Second step: the learning rates are chosen such that they satisfy assumption 
Al. 

• Third step: we check the noise conditions. 
Lemma 1. Let 

(S,a) I — > 7'j(S,a)VS G 5,3 Lj^s such that 
(Ci) : ||r,(S,a)-r,(S,a')|| <i,,s||a-a'|| V(a,a') 
(C2) : IEsij,s < +00 
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then the mapping a i — > Es?'j(S,a) is Lipschitz with Lipschitz constant Lj = KgLj s 
Proof. [Proof of Lemma [1] Suppose that a i — > Esrj{S,a) is Lipschitz with 
Lipschitz constant ij,s, then by Jensen's inequahty one has 

||Esrj(S,a) -Esrj(S,a')|| < Es||rj(S,a) - r,(S,a')|| 

By condition C2, Es-Lj,s < +00. Let Lj be EsLj,s- Then 

||Esr,(S,a)-Esr,(S,a')|| <ij||a-a'|| 
This completes the proof. 

□ 

Remark 3. 

• Note that under Ci and C2 the expected payoff vector r = {rj)jeAf is Lipschitz 
continuous with L = maxj Lj , 

• If S is a compact set and S 1 — > Lj s is continuous then a 1 — 5- Es?'j(S.a) is 
Lipschitz [In particular, the condition C2 is not needed] 

We shaU prove the above remark by Reductio ad ahsurdum. To prove the sec- 
ond statement of Remark [3] we use compactness and continuity argument. We start 
from Bolzano- Wierstrass theorem which states that. For any k, any continuous map 
X I — > f{k,a) over a compact set T) has at least one maximum, i.e., sup/(fc,a) = 
maxaep /(fc, a) < 00. The proof of this statement can be easily done by contradic- 
tion. Suppose sup/(fc,a) = 00. Then there exists a sequence {ai)i such that a; G I? 
but f{k,ai) — > 00 as I goes to infinity. This is impossible because V is compact 
which implies that f{k,'D) = {/(fc,a) |a e V} is bounded by continuity. 

Since S is compact and S 1 — > Ls is continuous, supg^^j is is also finite. 

Remark 4. //rj(S,a) is continuously differentiahle with the respect to a then it 
is sufficient to check the expectation of the gradient is hounded (in norm). 

if S is in Euclidean Space 

• rj is differentiahle w.r.t a 

• rj(S,a), Va''j(S,a) are continuous in S 

• rj(S,a), Va?'j(S,a) are ahsolutely integrable in S and Msrj(S, a) is continu- 
ous in a. 

then 

E[Var,(S,a)] = VaE[r,(S,a)] 

which can he written as 

f Var,(S,a)7(dS) Va / r,{S,a)j{dS) 
Js Js 

where 7 is the measure of S state space. For more details on the above conditions 
please refer to 119}. 

Since fj is a function of time and the actions of nodes, we need a uniform Lipschitz 
condition on fj. 
We have 

|/,(t,a) - f,{t,a')\ < b,z,\ sm{n,t + 0)| [||Esr,(S,a) - Esr,(S, a')||] 
But one has \ sin(.)| < 1. Hence, 

\f,{t,a) - /,(t,a')| < b,z, [||Esr,(S,a) -Esr,(S,a')||] 
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We use LemmaUl 



|/,(i,a)-/,(t,a')|<6,z,L,||a-a'|| 
This implies that the Lipschitz constant of fj is less than the one ofrj times the factor 
Finally, we check the noise conditions. The recursion equation is given by 

aj,k+i = aj^k + Afc[/j(fc,afe) + Mj^k+i] 

where Mj ^^i is a martingale difference sequence. By definition the martingale 
sequence for the algorithm is given as 

Mj,k+i = Zjbj sm{njk + (f>j) [rj.k+i ~ Es[rj,k+iiS,ak+i)]] 

which satisfied the condition E[Affc+i|J^fc] = for fc > almost surely (a.s.) 

Lemma 2. If Ok ^T) then the martingale is square-integrable with 

E[|jMfe+i|nj-fc] <c(l + ||afe|nvfc 
Proof. [Proof of Lemma [2] 

Let rj.k+i be the realization the payoff at time k + 1. The expected value of this 
random variable can be bounded above the norm of a^. 

Mj^k+i = Zjbj sin(^fc + (l)j){fj^k+i ~ Es[fj,fe+i(S,afe+i)]) 
\\M,,k+i\\ < |zj||6,||(sin(%fc + 0,)||lfj-fe+i -Es[fj,fc+i(S,afc+i)]||) 

< Zjbj iW f J ^k+i\\ + ||Es[fj-fc+i(S,afc+i)]||) 

< Zjbji\\fjM+i\\ +lEs||rj,fc+i(S,afc+i)||) 

< z6(||fj-fe+i|| +Es||7~j,fc+i(S,afe+i)||) 

Where |sin(.)| < 1, z = max|zj|, 6 = max|6j|, ||fj.fc+i|| is bounded because of 
the Lipschitz condition as mentioned in Ci , which is shown below. 



|ir,(S,afc) -rj(S, 0)11 <i,.s||afc- Oil V(afc) (A.l) 
||r,(S,a,)|| < ||r,(S,0)||+L,,s||afc|| 
< /3i,s +ij,s||afc|| 

Where /3i,s — lkj(S,0)||. The above equations lA.ll show that ||rj(S,a)|| is bounded 
by /3i,s + ij,s||a||- By taking expectation of the above set of inequalities we get. 

lEs||r,(S,afc)|| <Es||rj(S,0)|| +EsL,,s||afe|| (A.2) 
<L,|lafc||+Es||r,(S,0)|| 
< L,||afc|| +/32 

Where (32 = Es||r-j(S, 0), Lj = EsLj,s- The above set of inequalities IA.2I show that 
Es||rj(S, a)|| is bounded. 

Combining the results of inequalities in IA.lllA.2l we can get 
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||Mj- < z^^iPi.s + Lj,s\\&k\\ + L,\\ak\\ + ^2? 

<4zV(/32s+/3| + (i2s+L2)||a,||2) 

Taking Eg over the above inequalities we get: 

M\Mj,k+i\? < 4z'fo'(Es/3i' s + /32 + i^sLls + L ■)|lafe|l' 
< 4z%\p + L,\\ak\\^) 
<c(l + l|afeir) 



Where Lj ^ EsL|_s + > = Es/3i,s + ^1 and c > 4z262(/3 + Lj) 
This completes the proof. □ 

We now combine the above three steps to derive almost sure convergence to an 
ODE. To do so, we interpolate the stochastic process (an afSne interpolation) in 
order to get a continuous time process following the lines of Borkar [12] Chapter 2 
Lemma 1. The gap between the solution of the non- autonomous differential equation 
given by 

^at = f{t,a.t) 

and the interpolated process vanishes almost surely for asymptotic interval of length 
T > 0. 

lim sup \\a.q — a*|| =0 a.s. 

* qe[t,t+T] 

In order to calculate the bound we need to define a few terms which are helpful in 
obtaining a compact form of the bound. 

sup \\h{t) - a'-{t)\\ < KT^te"-^ + Ct\^^ 
te[tk,tk+T] 



k>0 



-SUp||(5t,4+fc| 
fc>0 



where 



KT.,^CTLY,^'+k+svip\\S, 



k>0 



t, t+fe I 



^t,t+k " ^t+k 
m=0 

Ct = ||r(0)|| + L{Co + ||r(0)||r)e^^ < oo 
p|^sup|la^|| <Co^ =1 

then we conclude by discrete adaptation of Lemma 1 in Borkar |12) . 
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A. 2. Fixed Step Size: Proof of Theorem [2l Theorem [5] states that Under 
Assumption A2, the learning algorithm converges in distribution to the trajectory of 
a non-autonomous system given by 



Qj^t = o,j^t + bj sm{iljt + 4>j) 

Proposition 1. Let kt he the interpolated version the trajectory of our algorithm 
at time t kt is the trajectory of the the ODE at time t. Under assumption A2 kt 
converges to kt as step size vanishes. 



E sup [\\kt^ktf]^ =CtVx 
te[o,T] 

Proposition [T] implies theorem [21 

Proof. [Proof of Proposition [1] To prove the above proposition we start with a 
fixed step size A > 0. 

• Time Scale. Tt = J2k=i A = tX, for i > 

• The cumulative noise at iteration t S^t — X]fc=i ^^^k+i = A J^kJi -^^fe+i 

• Define the (affine) interpolated process from {k}k>Q rewritten as 

aj,k+i = aj^k + Hfj{k, Ok) + Mfe+i). 

The advantage of the interpolated process is that it is defined for any con- 
tinuous time by concatenation. The affine interpolation writes dj^t = %,fc + 
( ^^A^*" ){0'3,k+i — 0-j^k) if i G [fcA, {k + 1)A[ which is now in continuous time. 
Note that constant learning rate or constant step size Af = A is suitable for many 
practical scenarios. It is used for example in numerical analysis: Euler-s Scheme (1st 
Order), Runge Kutta's scheme (4th Order), etc. Our algorithm writes 



(■^^) / «j,fe+i = O'j.k + Hb-jZj sin(fcff2j -I- <i>j))rj,t 

where A is a constant learning rate, our aim is to analyze (**) asymptotically when A 
is very small. In order to prove an asymptotic pseudo-trajectory result for constant 
learning rate, we need additional assumptions of the sequence generated by the powers. 
The key additional assumption is the uniform integrability of that process. We need 
the conditions Ci C2, which translate into 

- From Remark HJ gradient of the expectation of payoff is bounded 

- From Lemma [21 Square of the martingale is bounded 

- Uniform Integrability of rj(S, a) 

and kt is the solution of — f{t, kt) starting from a[^j 

17 



fc=l 

L — 4= J 



L — 3; J 

m rp 

aTt+„ = ^(Tt+fc - Tt+fc_i)/j([ — - — J'^|_ ^t+fc-i j) 



fc=i 



™ rTt+k-i 



A L.J 



/s=l 



rr^JTt^i. A L A J 



+ (6+m-6) + aTt 

rTt+k-i ^ 

aT,+^ =X] / /j(LTJ>aL^j)ds 

+ (6+m - 6) + aTt 

Now we use Burkholder's inequality which states the foUowing: For an a > 
there exists two constants ci > and C2 > such that 



ciE[^||a,-afe_i|p]«/2<E[sup||a,||] 
fe=i 

t 

<C2E[^||afe-a,_i||T/' 



fc=i 



CiE[Xl|r?fc-77fe_i||T/'<E[sup|hfe||] 

t 



<C2E[^||r7fc-r;fc||r/' 
fe=i 
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Take fjt — A^^.^,^ ||Aft+fc|p and we use discrete Gronwall inequality which states 
that 



where > Vt > 
for 



et+i < C + L^Aefe 



fc=0 



et+i < Ce 



L\t 



efe =E[sup \\kT^^^, - kT^^^,r] 

k'<k 



211/2 



C = XTKi^/TTCl + ^\K2{l + Cl), L ^ maxjeA/-Es[Lj,s] for some ifi, i^2 = C2 
from the above we deduce that 



E[sup llir,,,, - aT,^„ IP]^/' < a/ACt 



k'<k 



Ki = max(ci,C2\/l + Cq) 

This shows that E[supj,/<j. Ha^j^^, — ^t^^^, is bounded and imphes Propo- 

sition [H When A — !• we have a weak convergence of the interpolated process to a 
solution of the ODE. The error gap is a/ACt which vanishes as A — !• 0. 



Appendix B. Conditions for our Example. 

Following are some details about how to obtain a* for our application. 



9i,j - Eg5ij = Eg|/iijp 



equations we have the following matrix form. 



From remark H we can write Eg "^'3^-'" ' = -£-^Glj{G,a*) ^ 0. Solving N 



/an 



a = 



,G 



( 9i,i 91,2 

92,1 92,2 
\9N,1 gN,2 



91, n\ 

92, N 

9n,nJ 



"32,2 



The above equation can be written in the compact form as 



a* = G^a 

G should be invertible and all the elements in the vector a should be strictly 
positive as they are a linear combination of power and gains which are positive. We 
can also write togj.j > Xa^. For this example we can write 
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If this condition is satisfied then G is invertible. 

As G is a matrix of random channel gains it is ahiiost surely invertible. To show 
the invertibility of this matrix we just need to show that the det{G) ^ 
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